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EXTENDED ABSTRACT 



Abstract 

We craft a few scenarios for the execution of sequential and parallel jobs on future generation 
machines. Checkpointing or migration, which technique to choose? 

1 Introduction 

From fault-tolerance to resilience Q3H]- Large machines are subject to failures. Applications will face 
resource faults during execution. Fortunately, failure prediction is there to help. For instance, the system 
will receive an alarm when a disk or CPU becomes unusually hot. In that case, the application must 
dynamically do something to prepare for, and recover from, the expected failure. The goal is to compare 
two well-known strategies: 

• Checkpointing: purely local, but can be very costly 

• Migration: requires availability of a spare resource 

Finally, we assess the cost of periodically checkpoint parallel jobs in the absence of failure prediction. 

2 Notations 

• C: checkpoint save time (in minutes) 

• R: checkpoint recovery time (in minutes) 

• D: down/reboot time (in minutes) 

• M: migration time (in minutes) 

• N: total number of cluster nodes 

• fi: the mean time between failures (e.g., 1/A if the failures are exponentially distributed) 

Obviously, the checkpointing/migration comparison makes sense only if M < C + D + R, otherwise 
better use the faulty machine as its own spare. Techniques such as live migration |3J allow for migrating 
without any disk access, thereby dramatically reducing migration time. 



3 Sequential jobs 



3.1 Checkpointing 

We checkpoint just in time before the failure. Each resource is unavailable during C + D + R time-steps, 
and this happens every fi time-steps in average. Hence the global throughput is 

Pt * = H + C + D + R XN 

3.2 Migration 

Let us assume we keep m of the N nodes as spares. We need to ensure that we are never short of a 
spare machine. We encounter a problem in the execution if there are more than m resources that are 
engaged in migration or rebooting. The probability that, at a given time, a machine is not migrating or 
rebooting is: 

A* 



/' 



M + D ' 



and that it is migrating or rebooting is: 



M + D 



H + M + D 

Therefore, the probability that we do not encounter a problem is: 

fc=0 



5 (m) = 



u N - k v k 



So we need to find the good percentage of spare machines, say m = a(s)N, that "guarantees" a 
successful execution with probability at least 1 — e. Unfortunately, the expression for successful) doesn't 
allow for solving the success(m) > 1 — s equation analytically. It must therefore be solved numerically. 

Note that (1) > {N/k) k . Therefore, 



5(m) > ^ (N/k) 



k u N - k v k 



k=0 



which may be a bit easier to use for numerically solving the equation, and leads to an overestimation of 
the number of spares for achieving a probability of success 1 — e. 

Given m spares, the global throughput is 

P" 1 = TT* X ( N - m ) 
(i + M 

Remark 1. When there is a problem with migration, it does mean that the execution fails, because we 
cannot find a spare to replace a machine that goes down, and at that moment, it is too late to checkpoint. 



4 Parallel jobs 
4.1 Distribution 

The number of processors required by typical job obeys a strange distribution, which is a two-stage 
log-uniform distribution biased to powers of two, see [5]. We assume something similar but simpler: 

• let N = 2 Z for simplicity 

• the probability that a job is sequential is uq — p\ w 0.25 

• otherwise, the job is parallel, and the probability that it uses V processors is independent of j and 
equal to ctj = (1 — pi) x ^ for 1 < j < Z = log 2 TV 
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We assume a steady-state utilization of the whole platform, where all processors are active all the 
time, and where the proportion of jobs using any given number of processors remains constant. At any 
time-step, the expectation of the number of jobs that use 2 J processors exactly is (3j for < j < Z. The 
expectation of the total number of jobs running is K. We have: 

K = ][>• (1) 

1=0 

Pj = ajK for < j < Z (2) 
z 

N = J2 2i & ( 3 ) 

1=0 

We derive 

N \ a 1 — pi \ „• 1 — pi , 

- = '£V aj =p 1 + —J-l'£ i 2>=p 1 + ^-{2N-2) 

1=0 3 = 1 

hence the value of K, and then that of all the (3j. 



4.2 Checkpointing 

If a job uses two processors, then the expected interval time between failures is /Li/2. This is because the 
minimum of two identical exponential laws is exponential with a doubled parameter. More generally, 
let's call /j,k the mean of the minimum of 2 k i.i.d. variables. If the variables are exponentially distributed, 
with scale parameter A, then ^ = l/(A2 fe ). If the variables are Weibull, with scale parameter A and 
shape parameter a, then ^ik = AT(1 + l/(a2 fe )). 

For < k < Z, there are (5k x 2 fe processors running jobs with 2 fe parallel tasks, hence whose expected 
interval time between failures is [ik- The throughput is given as: 

Pcp = ][>x2 fc x -- ^-=—= . 

Hk + C + D + R 

For the exponential distribution, this becomes: 



z I 
p = V & x 2 fc x ■ 

P kT \ + 2K{C + D + R) 



4.3 Migration 

The probability of running OK is the same as for independent jobs: 



success(m) = V] I \u N ~ k v k . 
fe=o \ k ' 



Because there are only N — m machines "really" available, we scale the throughput by the factor (N — 
m)/N). The global throughput now becomes 

(V^ a r>fe /" i N - m 



5 Numerical Results 

In this section we present numerical results to understand the impact of checkpointing vs. migration 
under a number of scenarios, both in the "all sequential" case and in the "parallel jobs" case. All results 
are in percentage improvement of migration over checkpointing (negative or positive values). 
All results use the following values: 
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Table 1: "Today" scenario: C — 25, D — 2.5, M = 1. Percentage improvement of migration over 
checkpointing. Numbers of required spares in parentheses. 





Sequential Jobs 


Parallel Jobs 


t 


N 


£ = 10 4 


e = 10 y 


£ = 10 4 


£ = 10 6 




2 14 


2 75 ffiM 


2 65 (T\\ 


2081 37 (65\ 


2f)80 30 (7$\ 


1 dav 


2 17 


2.96 (386) 


2.93 (406) 


2760.06 (386) 


2759.62 (406) 




2 20 


3.03 (2732) 


3.02 (2786) 


3200.37 (2732) 


3200.20 (2786) 




2 14 


0.31 (16) 


0.27 (20) 


1196.77 (16) 


1196.45 (20) 


1 week 


2 17 


0.40 (73) 


0.39 (81) 


2158.28 (73) 


2158.15 (81) 




2 20 


0.43 (437) 


0.42 (458) 


2824.48 (437) 


2824.42 (458) 




2 14 


-0.02 (3) 


-0.04 (5) 


136.28 (3) 


136.25 (5) 


1 month 


2 17 


0.00 (8) 


0.00 (10) 


609.60 (8) 


609.59 (10) 




2 20 


0.01 (27) 


0.01 (32) 


1575.36 (27) 


1575.35 (32) 




2 14 


-0.02 (2) 


-0.02 (2) 


14.81 (2) 


14.81 (2) 


1 year 


2 17 


-0.00 (3) 


-0.00 (4) 


97.57 (3) 


97.57 (4) 




2 20 


0.00 (6) 


-0.00 (9) 


471.29 (6) 


471.29 (9) 



• fj, = 1 day, 1 week, 1 month, 1 year; 

• N = 10, 000, 100, 000, 1, 000, 000; 

• e = 10- 4 , 10- 6 . 

and with particular values of C = R, M, and D in the following scenarios. 

5.1 Scenario "today" 

• C = Re [20, 30] 

• De [1.5,5] 

• Me [.5, 1.5] (32GB on a lOGbps net) 

Results in Table [1] for particular values in the above ranges. 

5.2 Scenario "2011 HD" 

• C = R e [5, 10] 

• De [1.5,5] 

• Me [.5, 1.5] (64GB on a 20Gbps net) 

Results in Table [5] for particular values in the above ranges. 

5.3 Scenario "2011 SSD" 

• C = R e [4, 6] 

• De [1.5,5] 

• Me [.5, 1.5] (64GB on a 20Gbps net) 

Results in Table [3] for particular values in the above ranges. 



4 



Table 2: "2011 HD" Scenario: C = 7.5, D = 2.5, M = 1. Percentage improvement of migration over 
checkpointing. Number of required spares in parentheses. 





Sequential Jobs 


Parallel Jobs 


A* 


N 


e= 10 4 


e = 10° 


e= 10 4 


e = 10° 




2 14 


0.34 (65) 


0.25 (73) 


773.24 (65) 


772.81 (73) 


1 day 


2 17 


0.55 (386) 


0.52 (406) 


995.63 (386) 


995.46 (406) 




2 20 


0.62 (2732) 


0.61 (2786) 


1131.29 (2732) 


1131.23 (2786) 




2 14 


-0.03 (16) 


-0.08 (20) 


458.73 (16) 


458.59 (20) 


1 week 


2 17 


0.05 (73) 


0.04 (81) 


796.68 (73) 


796.63 (81) 




2 20 


0.08 (437) 


0.08 (458) 


1012.44 (437) 


1012.42 (458) 




2 14 


-0.03 (3) 


-0.06 (5) 


50.04 (3) 


50.02 (5) 


1 month 


2 17 


-0.01 (8) 


-0.01 (10) 


236.64 (8) 


236.64 (10) 




2 20 


0.00 (27) 


-0.00 (32) 


595.00 (27) 


595.00 (32) 




2 14 


-0.02 (2) 


-0.02 (2) 


4.86 (2) 


4.86 (2) 


1 year 


2 17 


-0.00 (3) 


-0.01 (4) 


35.06 (3) 


35.06 (4) 




2 20 


-0.00 (6) 


-0.00 (9) 


182.61 (6) 


182.61 (9) 



Table 3: "2011 SSD" scenario: C — 5, D = 2.5, M = 1. Percentage improvement of migration over 
checkpointing. Number of required spares in parentheses. 





Sequential Jobs 


Parallel Jobs 


A 4 


N 


e= 10 4 


£= 10 B 


e= 10 4 


e = 10 a 




2 14 


-0.00 (65) 


-0.10 (73) 


563.73 (65) 


563.40 (73) 


1 day 


2 17 


0.21 (386) 


0.17 (406) 


719.04 (386) 


718.91 (406) 




2 20 


0.27 (2732) 


0.26 (2786) 


811.69 (2732) 


811.64 (2786) 




2 14 


-0.08 (16) 


-0.13 (20) 


337.65 (16) 


337.55 (20) 


1 week 


2 17 


0.00 (73) 


-0.01 (81) 


580.07 (73) 


580.03 (81) 




2 20 


0.03 (437) 


0.03 (458) 


730.30 (437) 


730.28 (458) 




2 14 


-0.03 (3) 


-0.06 (5) 


35.92 (3) 


35.90 (5) 


1 month 


2 17 


-0.01 (8) 


-0.01 (10) 


174.29 (8) 


174.28 (10) 




2 20 


-0.00 (27) 


-0.00 (32) 


436.32 (27) 


436.32 (32) 




2 14 


-0.02 (2) 


-0.02 (2) 


3.40 (2) 


3.40 (2) 


1 year 


2 17 


-0.00 (3) 


-0.01 (4) 


25.00 (3) 


25.00 (4) 




2 20 


-0.00 (6) 


-0.00 (9) 


134.17 (6) 


134.17 (9) 
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Table 4: "2011 Flash" scenario: C = 1.5, D = 2.5, M = 1. Percentage improvement of migration over 
checkpointing. Number of required spares in parentheses. 





Sequential Jobs 


Parallel Jobs 


[1 


N 


e= 10 4 


E= ltf 


e= 10 4 


e = 10" 




2 14 


-0 48 ffiKl 


-0 58 f73l 


245 48 ('651 


245 31 (731 


1 dav 


2 17 


-0.28 (386) 


-0.31 (406) 


306.01 (386) 


305.95 (406) 




2 20 


-0.21 (2732) 


-0.22 (2786) 


339.91 (2732) 


339.89 (2786) 




2 w 


-0.15 (16) 


-0.20 (20) 


150.13 (16) 


150.07 (20) 


1 week 


2 17 


-0.07 (73) 


-0.08 (81) 


252.08 (73) 


252.06 (81) 




2 20 


-0.04 (437) 


-0.04 (458) 


310.19 (437) 


310.18 (458) 




2 14 


-0.04 (3) 


-0.06 (5) 


14.76 (3) 


14.75 (5) 


1 month 


2 17 


-0.01 (8) 


-0.01 (10) 


76.90 (8) 


76.90 (10) 




2 20 


-0.00 (27) 


-0.00 (32) 


192.75 (27) 


192.75 (32) 




2 14 


-0.02 (2) 


-0.02 (2) 


1.33 (2) 


1.33 (2) 


1 year 


2 17 


-0.00 (3) 


-0.01 (4) 


10.15 (3) 


10.15 (4) 




2 20 


-0.00 (6) 


-0.00 (9) 


58.63 (6) 


58.63 (9) 



5.4 Scenario "2011 Flash" 

• C= Re [1.5,2] 

• D e [1.5,5] 

• M e [.5, 1.5] (64GB on a 20Gbps net) 

Results in Table 2] for particular values in the above ranges. 

5.5 Scenario "2011 Flash" + Faster Reboot 

• C= Re [1.5,2] 

• D e [0,0.5] 

• M e [.51.5] (64GB on a 20Gbps net) 

Results in Table [5] for particular values in the above ranges. 

5.6 Scenario "2015" 

• C = Re [0,.15] 

• D e [0, .5] 

• Me [.5, 1.5] (128GB on a 40Gbps net) 

Results in Table [6] for particular values in the above ranges. 

5.7 Summary 

• Sequential jobs: forget migration 

• Parallel jobs: prefer migration, until checkpointing costs dramatically reduce (in proportion of 
migration costs) 
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Table 5: "2011 Flash + Faster Reboot" scenario: C — 1.5, D = 0.25, M = 1. Percentage improvement 
of migration over checkpointing. Number of required spares in parentheses. 





Sequential Jobs 


Parallel Jobs 




N 


e= 10 4 


e= 10 B 


s = 10 4 


e= 10 B 




2 14 


-0.21 (30) 


-0.27 (35) 


131.39 (30) 


131.32 (35) 


1 day 


2 17 


-0.08 (155) 


-0.10 (168) 


161.38 (155) 


161.35 (168) 




2 20 


-0.04 (1024) 


-0.05 (1056) 


177.39 (1024) 


177.38 (1056) 




2 14 


-0.09 (9) 


-0.12 (12) 


80.95 (9) 


80.92 (12) 


1 week 


2 17 


-0.03 (33) 


-0.04 (39) 


134.46 (33) 


134.45 (39) 




2 20 


-0.01 (174) 


-0.01 (188) 


163.10 (174) 


163.10 (188) 




2 14 


-0.02 (2) 


-0.04 (3) 


7.52 (2) 


7.51 (3) 


1 month 


2 17 


-0.01 (5) 


-0.01 (7) 


40.93 (5) 


40.93 (7) 




2 20 


-0.00 (14) 


-0.00 (17) 


103.70 (14) 


103.70 (17) 




2 14 


-0.01 (1) 


-0.02 (2) 


0.67 (1) 


0.66 (2) 


1 year 


2 17 


-0.00 (2) 


-0.00 (3) 


5.14 (2) 


5.14 (3) 




2 20 


-0.00 (4) 


-0.00 (6) 


30.95 (4) 


30.95 (6) 



Table 6: "2015" scenario: C = 0.05, D = 0.25, M = 1. Percentage improvement of migration over 
checkpointing. Number of required spares in parentheses. 





Sequential Jobs 


Parallel Jobs 


A* 


N 


e= 10 4 


£ = 10 6 


e= 10 4 


£ = 10 6 




2 14 


-0.41 (30) 


-0.47 (35) 


-47.52 (30) 


-47.54 (35) 


1 day 


2 17 


-0.28 (155) 


-0.30 (168) 


-55.58 (155) 


-55.58 (168) 




2 20 


-0.24 (1024) 


-0.25 (1056) 


-58.81 (1024) 


-58.81 (1056) 




2 14 


-0.12 (9) 


-0.15 (12) 


-28.92 (9) 


-28.94 (12) 


1 week 


2 17 


-0.06 (33) 


-0.07 (39) 


-48.25 (33) 


-48.25 (39) 




2 20 


-0.04 (174) 


-0.04 (188) 


-55.84 (174) 


-55.84 (188) 




2 14 


-0.02 (2) 


-0.04 (3) 


-2.25 (2) 


-2.25 (3) 


1 month 


2 17 


-0.01 (5) 


-0.01 (7) 


-13.47 (5) 


-13.48 (7) 




2 20 


-0.00 (14) 


-0.00 (17) 


-37.62 (14) 


-37.62 (17) 




2 14 


-0.01 (1) 


-0.02 (2) 


-0.20 (1) 


-0.21 (2) 


1 year 


2 17 


-0.00 (2) 


-0.00 (3) 


-1.52 (2) 


-1.52 (3) 




2 20 


-0.00 (4) 


-0.00 (6) 


-9.91 (4) 


-9.91 (6) 
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6 Impact of failure prediction 



In this section we deal with the case where no failure prediction is available. The idea is to checkpoint 
periodically. This raises two questions: 

1. How to determine the optimal period? 

2. What is the impact on platform throughput? 

Question 1 has received some attention in the literature for uni-processor jobs. Let T be the period, 
i.e. the time between two checkpoints, let C be the checkpoint duration time, and /x the expected interval 
time between failures. We compute W, the expected percentage of time lost, or "wasted", as in [6]: 

C T 

The first term in the right-hand side of Equation [5] is by definition, because there are C time-steps 
devoted to checkpointing every T time-steps. The second term accounts for the loss due to failures and 
is explained as follows: every fi time-steps, a failure occurs, and we lose an average of T/2 time-steps. 
Note that because the checkpoint and failure rates are independent, the quantity T/2 does not depend 
upon the failure distribution (Poisson, Weibull, etc). W is minimized for T opt — y/2Cji. This is Young's 

approximation 7J. The corresponding minimum waste is W m i n — ' — 




Equation [4] does not account for recovery time R after each failure. A more accurate expression is 
the following: 

W = - + 2 5 

T /j, 

Now in the right-hand side we state that every fi time-steps, a failure occurs, and we lose an average of 
? + R+ D time-steps. W is minimized for the same value T opt — \f2C\i as before, but the corresponding 
minimum waste becomes 

(6) 

Note that this is different from the first-order approximation given by Daly [3J equations (10) and 
(12)] because we target the steady-state operation of the platform rather than the optimization of the 
expected duration of a given job. 

It turns out that W m i n may become larger than 1 when \x gets very small, a situation which is more 
likely to happen with jobs requiring many processors. In that case the application is not progressing any 
more. To solve for W m i n < 1 in Equation [SJ wc let v = and derive 

v 2 {R + D) + vV2C - 1 < 
We get W m i n < 1 if v < (hence /i > l/v%) with 



Vb = 



In all cases, the minimum waste is 

6.1 Independent jobs 

We simply write that the throughput is 



-V2C + y/2C + 4(R + D) 
2(R + D) ' 

min(W mm , 1) 



P = (1 - W mm )N 
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Table 7: Yield p/N for C = D + R = 1 and Pl = 0.25. Parallel jobs with Pl = 0.25. 



N 


Yield (jU = 1 month) 


Yield (fi — 1 year) 


2* 


90.8% 


97.5% 


2 11 


69.9% 


92.6% 


2 14 


13.5% 


76.3% 


2 17 


01.7% 


22.1% 


2 20 


00.2% 


02.8% 



6.2 Parallel jobs 

We assume the same distribution of parallel jobs as in Section 14.11 and we keep the same notations K 
(number of jobs), (3k for 1 < k < Z = log 2 N (number of jobs of size 2 fe ), and (expected interval time 
between failures for a job using 2 fc processors). 

With 2 k processors we use fik instead of fj, in Equation[5]to derive the minimum waste W m i n (k). The 
throughput becomes 

z 

k=0 

6.3 Numerical Results 

Here is a typical result for parallel jobs: 
• C= D =R= 1 

• /j, = 1 month or 1 year 

• p! = 0.25 

Results in Table [7] for particular values of N. 

7 Conclusion 

New software/hardware techniques are needed in order to reduce checkpoint, recovery, and migration 
times. This is a condition for parallel jobs to execute at a satisfying rate on future massively parallel 
machines. 

As for migration, we point out another requirement, namely being able to rely on accurate failure 
predictions. 

Another direction is to design " self- fault-tolerant" algorithms (e.g. asynchronous iterative algorithms) 
whose execution can progress in the presence of local faults. Also, replication techniques should be 
investigated: despite the resource costs induced by duplicating the same tasks on different processors, 
replication can dramatically increase the reliability of the whole application. 

Most likely, parallel jobs will be deployed on large-scale machines through a mix of all previous 
techniques (checkpointing, migration, replication, self-tolerant variants). 
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